Global Deaths Due to Air Pollution

Elizabeth Bekele, Alison Cheek

2022-05-03

Introduction

Packages Required

#This will allow us to filter through our data 
library(tidyverse)
library(dplyr)
#This will help us plot figures to showcase our findings
library(ggplot2)
#This will help us organize and display our data as necessary 
library(knitr)
library(kableExtra)
#This expands our plot uses 
library(plotly)
#Scientific Notation Disabled 
options(scipen=999)

Pollution Data

Import the deaths-due-to-air-pollution data

deaths_df <- data.frame(read.csv("death-rates-from-air-pollution.csv"))

We are going to rename a few of the columns and glimpse the data

colnames(deaths_df) <- c("country", "acronym", "year", "total_deaths", "indoor_deaths", "outdoor_deaths", "ozone_deaths")

glimpse(deaths_df)
## Rows: 6,468
## Columns: 7
## $ country        <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanist…
## $ acronym        <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG",…
## $ year           <int> 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1…
## $ total_deaths   <dbl> 299.4773, 291.2780, 278.9631, 278.7908, 287.1629, 288.0…
## $ indoor_deaths  <dbl> 250.3629, 242.5751, 232.0439, 231.6481, 238.8372, 239.9…
## $ outdoor_deaths <dbl> 46.44659, 46.03384, 44.24377, 44.44015, 45.59433, 45.36…
## $ ozone_deaths   <dbl> 5.616442, 5.603960, 5.611822, 5.655266, 5.718922, 5.739…

Data Variables

Variables that interest us here include:

World Population Data

Now, let’s take a look at the population data.

world_pop <- read.csv("population_total_long.csv")
glimpse(world_pop)
## Rows: 12,595
## Columns: 3
## $ Country.Name <chr> "Aruba", "Afghanistan", "Angola", "Albania", "Andorra", "…
## $ Year         <int> 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960, 196…
## $ Count        <int> 54211, 8996973, 5454933, 1608800, 13411, 92418, 20481779,…

To get a general idea of ‘deaths-dataframe’ we made, let’s make a plots to see what’s happening. This is a plot of indoor x outdoor deaths around the world by country.

This is a mess, and so we chose two countries from each continent (a high-population and a low-population country) to graph.

Combine Data Sets

First let’s look at a table of the high and low populated countries using the world population data set.

## # A tibble: 6 × 3
## # Groups:   Year [1]
##   Country.Name   Year     Count
##   <chr>         <int>     <int>
## 1 Australia      1997  18517000
## 2 Brazil         1997 167209040
## 3 Germany        1997  82034771
## 4 Nigeria        1997 113457663
## 5 Pakistan       1997 131057431
## 6 United States  1997 272657000
## # A tibble: 6 × 3
## # Groups:   Year [1]
##   Country.Name  Year    Count
##   <chr>        <int>    <int>
## 1 Canada        1997 29905948
## 2 Chile         1997 14786220
## 3 Sri Lanka     1997 18470900
## 4 Malawi        1997 10264906
## 5 New Zealand   1997  3781300
## 6 Serbia        1997  7596501

Next, we are going to see the death count for high and low populated countries using the deaths dataframe.

## # A tibble: 6 × 7
## # Groups:   year [6]
##   country   acronym  year total_deaths indoor_deaths outdoor_deaths ozone_deaths
##   <chr>     <chr>   <int>        <dbl>         <dbl>          <dbl>        <dbl>
## 1 Australia AUS      1997         22.4         0.322           21.8        0.314
## 2 Australia AUS      1998         21.5         0.284           21.0        0.305
## 3 Australia AUS      1999         20.4         0.259           19.9        0.295
## 4 Australia AUS      2000         19.4         0.240           18.9        0.290
## 5 Australia AUS      2001         18.6         0.223           18.1        0.284
## 6 Australia AUS      2002         18.1         0.211           17.7        0.286
## # A tibble: 6 × 7
## # Groups:   year [6]
##   country acronym  year total_deaths indoor_deaths outdoor_deaths ozone_deaths
##   <chr>   <chr>   <int>        <dbl>         <dbl>          <dbl>        <dbl>
## 1 Canada  CAN      1997         21.9        0.0878           19.9         2.20
## 2 Canada  CAN      1998         21.7        0.0824           19.6         2.21
## 3 Canada  CAN      1999         21.2        0.0751           19.2         2.19
## 4 Canada  CAN      2000         20.3        0.0682           18.3         2.13
## 5 Canada  CAN      2001         19.8        0.0641           17.9         2.08
## 6 Canada  CAN      2002         19.5        0.0605           17.7         2.05

Lastly, we will join the population and and deaths with its respected country.

## # A tibble: 6 × 8
## # Groups:   year [6]
##   country   acronym  year total_deaths indoor_deaths outdoor_deaths ozone_deaths
##   <chr>     <chr>   <int>        <dbl>         <dbl>          <dbl>        <dbl>
## 1 Australia AUS      1997         22.4         0.322           21.8        0.314
## 2 Australia AUS      1998         21.5         0.284           21.0        0.305
## 3 Australia AUS      1999         20.4         0.259           19.9        0.295
## 4 Australia AUS      2000         19.4         0.240           18.9        0.290
## 5 Australia AUS      2001         18.6         0.223           18.1        0.284
## 6 Australia AUS      2002         18.1         0.211           17.7        0.286
## # … with 1 more variable: Count <int>
## # A tibble: 6 × 8
## # Groups:   year [6]
##   country acronym  year total_deaths indoor_deaths outdoor_deaths ozone_deaths
##   <chr>   <chr>   <int>        <dbl>         <dbl>          <dbl>        <dbl>
## 1 Canada  CAN      1997         21.9        0.0878           19.9         2.20
## 2 Canada  CAN      1998         21.7        0.0824           19.6         2.21
## 3 Canada  CAN      1999         21.2        0.0751           19.2         2.19
## 4 Canada  CAN      2000         20.3        0.0682           18.3         2.13
## 5 Canada  CAN      2001         19.8        0.0641           17.9         2.08
## 6 Canada  CAN      2002         19.5        0.0605           17.7         2.05
## # … with 1 more variable: Count <int>

Death Count

Which country has the highest death count?

Let’s make a table depicting the high and low populated countries and their respected death count due to pollution.

country average_death_high
Australia 17.76815
Brazil 48.42928
Germany 28.10988
Nigeria 112.30157
Pakistan 144.33463
United States 26.35827
country average_death_low
Canada 18.18542
Chile 36.51321
Malawi 147.77167
New Zealand 15.92536
Serbia 80.66558
Sri Lanka 69.60383

Here’s a graph to clearly visualize the previous table

Pollution Types

Which type of pollution has the greatest number of deaths?

## # A tibble: 6 × 4
##   country       avg_indoor avg_outdoor avg_ozone
##   <chr>              <dbl>       <dbl>     <dbl>
## 1 Australia          0.249        17.2     0.360
## 2 Brazil            19.4          26.8     2.74 
## 3 Germany            0.717        25.5     2.34 
## 4 Nigeria           75.9          35.2     2.12 
## 5 Pakistan          87.7          50.5    10.4  
## 6 United States      0.166        22.8     3.92
## # A tibble: 6 × 4
##   country     avg_indoor avg_outdoor avg_ozone
##   <chr>            <dbl>       <dbl>     <dbl>
## 1 Canada          0.0651        16.4    1.97  
## 2 Chile           8.69          27.2    0.850 
## 3 Malawi        132.            13.8    3.39  
## 4 New Zealand     0.291         15.6    0.0728
## 5 Serbia         35.9           42.7    2.94  
## 6 Sri Lanka      44.5           24.8    0.430

Pollution Over Time

Let’s look at the previous two decades and compare the death count Has there been a change?

This is the first decade 1996-2006
country High_Deaths_96 High_Deaths_01 High_Deaths_06
Australia 23.04465 18.58572 14.92239
Brazil 60.67757 49.46436 41.46829
Germany 34.72325 28.38756 23.83654
Nigeria 136.08978 123.05129 102.26653
Pakistan 155.42988 151.25352 146.09296
United States 29.99271 28.93114 25.93369
country Low_Deaths_96 Low_Deaths_01 Low_Deaths_06
Australia 22.18101 19.82451 14.92239
Brazil 46.36829 37.43188 41.46829
Germany 183.14179 165.41702 23.83654
Nigeria 93.44700 83.18333 102.26653
Pakistan 85.28997 72.16239 146.09296
United States 100.66078 95.27073 25.93369
This is the second decade 2007-2017
country High_Deaths_07 High_Deaths_12 High_Deaths_17
Australia 14.92140 12.65973 10.79595
Brazil 40.42460 35.39069 30.32108
Germany 23.45850 20.91536 19.82826
Nigeria 98.90306 84.22324 81.22147
Pakistan 143.81724 133.93887 123.21548
United States 25.11756 21.98194 18.82515
country Low_Deaths_07 Low_Deaths_12 Low_Deaths_17
Australia 16.93196 13.82968 10.79595
Brazil 30.53130 27.31475 30.32108
Germany 132.12253 116.27470 19.82826
Nigeria 76.65752 72.77354 81.22147
Pakistan 66.05987 59.22433 123.21548
United States 87.81178 79.49336 18.82515

Let’s graph the previous tables!

Which year had the worst indoor? Outdoor particulate? Outdoor ozone?

Which is worse - outdoor or indoor pollution?

First, we split the data into high and low population based on country

Low population = high population * .10

## # A tibble: 126 × 3
## # Groups:   Year [21]
##    Country.Name   Year     Count
##    <chr>         <int>     <int>
##  1 Australia      1997  18517000
##  2 Brazil         1997 167209040
##  3 Germany        1997  82034771
##  4 Nigeria        1997 113457663
##  5 Pakistan       1997 131057431
##  6 United States  1997 272657000
##  7 Australia      1998  18711000
##  8 Brazil         1998 169785250
##  9 Germany        1998  82047195
## 10 Nigeria        1998 116319759
## # … with 116 more rows
## # A tibble: 126 × 3
## # Groups:   Year [21]
##    Country.Name  Year    Count
##    <chr>        <int>    <int>
##  1 Canada        1997 29905948
##  2 Chile         1997 14786220
##  3 Sri Lanka     1997 18470900
##  4 Malawi        1997 10264906
##  5 New Zealand   1997  3781300
##  6 Serbia        1997  7596501
##  7 Canada        1998 30155173
##  8 Chile         1998 14977733
##  9 Sri Lanka     1998 18564599
## 10 Malawi        1998 10552338
## # … with 116 more rows
#Mean total deaths from 1996-2017 of high-population countries
deaths_highpop_countries <- deaths_df %>% 
  filter(country %in% c('United States', 'Brazil', 'Nigeria', 'Germany', 'Pakistan', 'Australia')) %>% 
  group_by(country) %>% 
  select(total_deaths) %>% 
  summarize(average_death_high = mean(total_deaths))
## Adding missing grouping variables: `country`
#Mean total deaths from 1990-2017 of high-population countries
deaths_lowpop_countries<- deaths_df %>% 
  filter(year> 1995 & country %in% c('Canada', 'Chile', 'Malawi', 'Serbia', 'Sri Lanka', 'New Zealand')) %>% 
  group_by(country) %>% 
  select(total_deaths) %>% 
  summarize(average_death_low = mean(total_deaths))
## Adding missing grouping variables: `country`
#death_lowpop_countries
kable(list(deaths_highpop_countries, deaths_lowpop_countries))
country average_death_high
Australia 17.76815
Brazil 48.42928
Germany 28.10988
Nigeria 112.30157
Pakistan 144.33463
United States 26.35827
country average_death_low
Canada 16.86963
Chile 32.58415
Malawi 140.50830
New Zealand 14.08771
Serbia 78.12194
Sri Lanka 65.51438
ggplot(deaths_highpop_countries)+
  geom_col(mapping = aes(x=country, y=average_death_high))+
             xlab("Country")+
             ylab("Average deaths (per 100,000)")+
             ggtitle("Average total deaths in high-population countries")+
  coord_flip()

ggplot(deaths_lowpop_countries)+
  geom_col(mapping = aes(x=country, y=average_death_low))+
             xlab("Country")+
             ylab("Average deaths (per 100,000)")+
             ggtitle("Average total deaths in low-population countries")+
  coord_flip()

This shows us the deaths due to pollution, but what about the average population of those countries at that time?

hp_countries_population <- world_pop %>% 
  filter(Country.Name %in% c('United States', 'Brazil', 'Nigeria', 'Germany', 'Pakistan', 'Australia'), Year > 1996) %>% 
  group_by(Country.Name) %>% 
  select(Count) %>% 
  summarize(average_population = mean(Count))
## Adding missing grouping variables: `Country.Name`
lp_countries_population <- world_pop %>% 
  filter(Country.Name %in% c('Canada', 'Chile', 'Malawi', 'Serbia', 'Sri Lanka', 'New Zealand'), Year > 1996) %>% 
  group_by(Country.Name) %>% 
  select(Count) %>% 
  summarize(average_population = mean(Count))
## Adding missing grouping variables: `Country.Name`
#Population Average Table
kable(list(hp_countries_population, lp_countries_population))
Country.Name average_population
Australia 21217772
Brazil 189132292
Germany 81914540
Nigeria 148549958
Pakistan 168525322
United States 300447600
Country.Name average_population
Canada 33029774
Chile 16555805
Malawi 13605376
New Zealand 4214995
Serbia 7345882
Sri Lanka 19824652
#Graph of Population Average
ggplot(hp_countries_population)+
  geom_col(mapping = aes(x=Country.Name, y=average_population))+
             xlab("Country")+
             ylab("Average Population")+
             ggtitle("Average high-population countries")+
  coord_flip()

ggplot(lp_countries_population)+
  geom_col(mapping = aes(x=Country.Name, y=average_population))+
             xlab("Country")+
             ylab("Average Population")+
             ggtitle("Average low-population countries")+
  coord_flip()

#Percentage of those affected 
affected_high<- joined_high %>% 
  mutate(percent_high = total_deaths/Count)

percent_high<- affected_high %>% ggplot(aes(x = year, y = percent_high, color = country))+
  geom_point() +
  labs(title = "Percentage of High Populated Countries Affected by Air Pollution") +
  xlab("Year")+
  ylab("Percent (total_deaths/Count)")
ggplotly(percent_high)
affected_low <- joined_low %>% 
  mutate(percent_low = total_deaths/Count)

percent_low <- affected_low %>% ggplot(aes(x = year, y = percent_low, color = country))+
  geom_point() +
  labs(title = "Percentage of Low Populated Countries Affected by Air Pollution") +
  xlab("Year")+
  ylab("Percent (total_deaths/Count)")
ggplotly(percent_low)

Summary